Member 1: Dhruv Dewan, Contribution: 90% (did not contribute to G).
Member 2: Anish Nandyala, Contribution: 90% (did not contribute to B).
Member 3: Aditya Prashanth, Contribution: 90% (did not contribute to C).
"We, all team members, agree together that the above information is true, and we are confident about our contributions to this submitted project/final tutorial."
Dhruv Dewan, Anish Nandyala, Aditya Prashanth - 05/07/2024
Dhruv Dewan - I worked together with Anish and Aditya on most sections of this final tutorial. Specific places where my contribution is more critical is in the creation of the LSTM model and it's layers, the creation and visualization of the stocks and technical indicators, and several of the financially backed explanations.
Anish Nandyala - I worked on all the sections but the data curation section for this final tutorial. I especially spent the most time in the data exploratory analysis section of the tutorial, and brainstorming the hypothesis tests and correlation tests to use. I also contributed to the visualization of the ML model for the loss plots.
Aditya Prashanth - I worked on a variety of sections across the proposal. My biggest contributions were in determining the features and creating the training and testing datasets used for the model, creating an initial regression model, and writing the insights and conclusion for the project.
In today's fast-paced financial environment, navigating the stock market's continuously changing waves can feel like high-stakes gambling. With fortunes climbing and falling in a split second, investors constantly seek the keys to predictability and profitability. The concept of Exchange-Traded Funds (ETFs) has provided a diversified avenue for investment, offering exposure to a basket of assets within a single fund. Among these, the Magnificent Seven ETF (from Roundhill Investments) stands out, comprised of seven tech giants: Apple, Amazon, Meta (Facebook), Alphabet (Google), Tesla, Nvidia, and Microsoft. It is well known that these tech stocks and other tech stocks in general are very volatile and prone to great change. With the risk that this volatility brings, it accompanies great potential for profit as well. That is what our motivation for this project stems from.
Our project aims to utilize the power of machine learning to discover the hidden features of stock prediction. At its core, we aim to address a fundamental question: Can we leverage historical stock data, technical indicators, and other features to forecast the future performance of the Magnificent Seven ETF with machine learning?
In an industry where volatility rules the market, the ability to anticipate price movements with a degree of accuracy holds great value. Successful predictions empower investors to make informed decisions, mitigate risks, and capitalize on opportunities for profit. Furthermore, in the realm of ETFs, where the fortunes of multiple companies are intertwined, the stakes are even higher, and the potential rewards even more enticing. To achieve our goal, we will dive into the realm of technical analysis, a cornerstone of financial forecasting. By identifying and utilizing key technical indicators, widely recognized within the world of trading, we aim to construct a framework for predicting the future performance of the Magnificent Seven ETF. From moving averages to relative strength index (RSI), these indicators offer valuable insights into market trends, momentum, and sentiment, serving as the foundation upon which our predictive models will be built.
In summary, our project not only aims to create our own stock prediction model for the MAG7, but also to demonstrate the potential of machine learning in revolutionizing investment strategies.
# Libraries
import pandas as pd
import numpy as np
import yfinance as yf
import scipy
import warnings
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from keras.models import Sequential
from keras.layers import LSTM, Dense
from keras.callbacks import EarlyStopping
# import plotting tools
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
# interactive plot stuff
import plotly.graph_objects as go
import plotly
# technical analysis
import pandas_ta as ta
In Python, the yfinance library serves as a valuable tool for accessing financial data from Yahoo Finance. The download function from yfinance is used to retrieve historical market data pertaining to the Magnificent Seven ETF, by its ticker symbol 'MAGS'. The period of time specified is 1 year, because the MAG7 ETF has only existed since last April, and for the purpose of our project it makes sense to have data for a year for clean comparisons. Upon execution, the resulting dataset, representing the Magnificent Seven ETF's historical market activity, is stored in a pandas DataFrame format. We get a succinct preview of the first few rows using .head(), to get insight into the structure and content of the retrieved data.
Using the following link, you can read the documentation of the yfinance library and it's API documentation: https://pypi.org/project/yfinance/
mag7_data = yf.download('MAGS', period='1y')
[*********************100%%**********************] 1 of 1 completed
We first check the count of rows in our dataframe.
print(mag7_data.count())
Open 252 High 252 Low 252 Close 252 Adj Close 252 Volume 252 dtype: int64
Below we clean our data by dropping NA values and duplicates, just in case to have clean data prepped for comparisons, testing, and analysis.
mag7_data = mag7_data.dropna()
mag7_data = mag7_data.drop_duplicates()
The plot below displays the adjusted close prices of the Magnificent Seven ETF over one year, illustrating its performance over time. We choose the Adjusted Close price to visualize becuase of its ability to account for splits, dividend distributions, and other corporate actions that can affect the stock. This metric provides a smooth, consistent view of the ETF's performance, allowing us to make easy comparisons and analysis of the investments returns. By focusing on adjusted close prices, the plot emphasizes the ETF's overall performance. Which captures both capital appreciation and dividend distributions within the one-year timeframe.
plt.figure(figsize=(14,5))
sns.set_style("ticks")
sns.lineplot(data=mag7_data, x="Date", y='Adj Close', color='green')
sns.despine()
plt.title("Adjusted Close of Magnificent Seven ETF over time", size='medium', color='black')
Text(0.5, 1.0, 'Adjusted Close of Magnificent Seven ETF over time')
Below we describe our dataframe, which provides a concise statistical summary of the Magnificent Seven ETF's historical market data. It includes essential metrics like count, mean, standard deviation, minimum, maximum, and quartiles for the adjusted close prices. This summary offers quick insights into the distribution and characteristics of the ETF's performance, aiding in analysis and decision-making for investment strategies.
mag7_data.describe()
| Open | High | Low | Close | Adj Close | Volume | |
|---|---|---|---|---|---|---|
| count | 252.000000 | 252.000000 | 252.000000 | 252.000000 | 252.000000 | 2.520000e+02 |
| mean | 32.922198 | 33.137563 | 32.626996 | 32.883524 | 32.799326 | 8.497460e+04 |
| std | 3.693590 | 3.719676 | 3.598968 | 3.673566 | 3.726901 | 1.248657e+05 |
| min | 25.882000 | 26.129999 | 25.882000 | 26.030001 | 25.917496 | 5.000000e+02 |
| 25% | 30.204999 | 30.288751 | 29.908249 | 30.172500 | 30.042091 | 4.375000e+03 |
| 50% | 31.625000 | 31.835000 | 31.353500 | 31.604000 | 31.467403 | 3.140000e+04 |
| 75% | 36.417500 | 36.697500 | 36.000000 | 36.429998 | 36.429998 | 1.306250e+05 |
| max | 40.450001 | 40.493999 | 40.029999 | 40.380001 | 40.380001 | 1.282400e+06 |
Below we download the stock data for TSM (Taiwan Semiconductor Manufacturing Company).
Downloading stock data for TSM is motivated with the intention to explore its relationship with Nvidia's and Apple's stock performance. This correlation is particularly important because Nvidia and Apple sources its Graphics Processing Units (GPUs) from TSM. By analyzing the stock data of the company, we aim to discover patterns, trends, and potential interactions between their respective stock prices. Understanding this relationship can offer us valuable insights for our prediction since certain weightage of the Magnificent Seven comes from Nvidia and Apple.
In addition to TSM, we downloaded data on Emerson Electric (EMR), which is Tesla's key supplier, and we aim to utilize that correlation as part of the model as well.
Lastly, we have downloaded Intel data due to their involvement in the semiconductor industry. Several Mag 7 companies can directly be influenced by Intel's movement and in our EDA, we can test this correlation.
tsm_data = yf.download('TSM', period='1y')
emr_data = yf.download('EMR', period='1y')
intel_data = yf.download('INTC', period='1y')
[*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed [*********************100%%**********************] 1 of 1 completed
We proceed to clean the rest of these stock dataframes before any analysis.
tsm_data = tsm_data.dropna()
tsm_data = tsm_data.drop_duplicates()
intel_data = intel_data.dropna()
intel_data = intel_data.drop_duplicates()
intel_data = intel_data.dropna()
intel_data = intel_data.drop_duplicates()
Now that we have data that is ready to be used for analysis, we can start exploring this data to find high level relationships between the features of each dataset, and find comparisons between the different datasets we have chosen.
First, we want to check how the MAGS stock compares to the TSM stock due to the relationship mentioned above between Apple/Nvidia and TSM. We can first start off by checking how different these two distributions are. We can accomplish this by using a two-sample T-test.
Using the following link, you can read the documentation of the scipy library and it's API documentation: https://docs.scipy.org/doc/scipy/
Our hypothesis are listed below:
Null Hypothesis: There is no significant difference between the average adjusted close prices of the MAGS and TSM stocks over the last year.
Alternative Hypothesis: The average adjusted close prices of the stocks MAGS and TSM significantly diverge from each other over the last year.
t_stat, p_value = scipy.stats.ttest_ind(mag7_data['Adj Close'], tsm_data['Adj Close'])
print("T-statistic:", t_stat)
print("P-Value: ", p_value)
T-statistic: -60.47918730129428 P-Value: 1.1760013176125386e-232
combined_data = pd.concat([mag7_data['Adj Close'], tsm_data['Adj Close']], axis=1)
combined_data.columns = ['MAGS', 'EMR']
plt.figure(figsize=(10, 6))
sns.set_style("whitegrid")
sns.boxplot(data=combined_data)
plt.title('Comparison of Adjusted Close Prices: MAGS vs TSM')
plt.xlabel('Stock')
plt.ylabel('Adjusted Close Price')
plt.show()
We reject the null hypothesis as the p-value is much less than 0.05, which is our significance level (alpha). Let's visualize this difference in distribution for both these stocks. The hypothesis test simply shows that there is a large difference in the valuation of each stock. To fully gain more insight, we can move on to a correlation between the stocks to see if their movement is similar. If they are correlated in their movement, the TSM stock can be a great feature to use, as we can track it to gain certain insights into the value of the Magnificent Seven ETF.
Next, we can move on to see instead how the trends of these two stocks correlate over the past year. Let's first visualize the trends of each of the stocks individually.
with warnings.catch_warnings():
warnings.simplefilter("ignore")
plt.figure(figsize=(14,5))
sns.set_style("ticks")
sns.lineplot(data=mag7_data, x="Date", y='Adj Close', color='green', label='MAGS')
sns.lineplot(data=tsm_data, x="Date", y='Adj Close', color='purple', label='TSM')
plt.title('Adjusted Close Prices of MAGS and TSM: 05-2023 to 05-2024')
plt.legend()
plt.show()
By just looking at this graph, we cannot visually see much correlation between the two stocks. This may be because the adjusted close stock prices for TSM vary on a bigger range over the year than the MAGS adjusted close prices. Let's do a more in-depth analysis using Pearson's correlation coefficient to see the relationship between the adjusted close prices of both of the stocks.
mags_close = mag7_data['Adj Close']
tsm_close = tsm_data['Adj Close']
correlation = np.corrcoef(mags_close, tsm_close)[0, 1]
print("Pearson's correlation coefficient:", correlation)
plt.figure(figsize=(8, 6))
sns.scatterplot(x=mags_close, y=tsm_close)
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(mags_close, tsm_close)
x_values = np.linspace(min(mags_close), max(mags_close), 100)
y_values = slope * x_values + intercept
plt.plot(x_values, y_values, color='black', label=f'Linear Regression (R={correlation:.2f})')
plt.title('Scatter Plot of Adjusted Close Prices: MAGS vs TSM')
plt.xlabel('MAGS Adjusted Close Price')
plt.ylabel('TSM Adjusted Close Price')
plt.show()
Pearson's correlation coefficient: 0.9396902858140396
The high correlation coefficient obtained from a Pearson correlation test between the prices of Magnificent Seven and TSM (Taiwan Semiconductor Manufacturing Company) suggests a strong linear relationship between the two stocks. In the context of the Magnificent Seven movement, this high correlation could indicate several interconnected factors.
Firstly, it reflects common industry trends that heavily influence the semiconductor sector, such as technological advancements, shifts in demand for electronic devices, and broader economic conditions. Secondly, it may reflect the intricate supply chain dependencies within the semiconductor industry, given TSM's role as a significant manufacturer of chips for various companies, including Nvidia and Apple. Changes in TSM's performance or production capacity could significantly impact Nvidia's operations, as a result affecting investor sentiment for both firms. Additionally, the correlation could mirror shared market sentiment towards the semiconductor industry as a whole, where positive or negative news may affect both Nvidia and TSM's stock prices concurrently.
We can evaluate the relationship between the MAGS stock and Emerson Electric, Co. (EMR) in the same way. We start by graphing the adjusted close data for the past year of both of these stocks.
with warnings.catch_warnings():
warnings.simplefilter("ignore")
plt.figure(figsize=(14,5))
sns.set_style("ticks")
sns.lineplot(data=mag7_data, x="Date", y='Adj Close', color='green', label='MAGS')
sns.lineplot(data=emr_data, x="Date", y='Adj Close', color='red', label='EMR')
plt.title('Adjusted Close Prices of MAGS and EMR: 05-2023 to 05-2024')
plt.legend()
plt.show()
Looking at this graph, we can see a slight similarity in the trends of both of the stocks. We can further confirm this by finding the Pearson's correlation coefficient once more.
mags_close = mag7_data['Adj Close']
emr_close = emr_data['Adj Close']
correlation = np.corrcoef(mags_close, emr_close)[0, 1]
print("Pearson's correlation coefficient:", correlation)
plt.figure(figsize=(8, 6))
sns.scatterplot(x=mags_close, y=emr_close)
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(mags_close, emr_close)
x_values = np.linspace(min(mags_close), max(mags_close), 100)
y_values = slope * x_values + intercept
plt.plot(x_values, y_values, color='black', label=f'Linear Regression (R={correlation:.2f})')
plt.title('Scatter Plot of Adjusted Close Prices: MAGS vs EMR')
plt.xlabel('MAGS Adjusted Close Price')
plt.ylabel('EMR Adjusted Close Price')
plt.show()
Pearson's correlation coefficient: 0.8851207432962013
The high correlational coefficient between the Magnificent Seven ETF and EMR (Emerson Electric Co.), Tesla's supplier, indicates a strong statistical relationship between their performances. This suggests that changes in Mag 7's value are closely associated with corresponding changes in EMR's value.
This correlation could arise from many factors such as their business relationship, shared industry trends, and market changes. For us, understanding this correlation offers insights into predicting the Mag 7, with other industry knowledge that can be gained from the performance of EMR.
with warnings.catch_warnings():
warnings.simplefilter("ignore")
plt.figure(figsize=(14,5))
sns.set_style("ticks")
sns.lineplot(data=mag7_data, x="Date", y='Adj Close', color='green', label='MAGS')
sns.lineplot(data=intel_data, x="Date", y='Adj Close', color='purple', label='EMR')
plt.title('Adjusted Close Prices of MAGS and Intel: 05-2023 to 05-2024')
plt.legend()
plt.show()
mags_close = mag7_data['Adj Close']
intel_close = intel_data['Adj Close']
correlation = np.corrcoef(mags_close, intel_close)[0, 1]
print("Pearson's correlation coefficient:", correlation)
plt.figure(figsize=(8, 6))
sns.scatterplot(x=mags_close, y=intel_close)
slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(mags_close, intel_close)
x_values = np.linspace(min(mags_close), max(mags_close), 100)
y_values = slope * x_values + intercept
plt.plot(x_values, y_values, color='black', label=f'Linear Regression (R={correlation:.2f})')
plt.title('Scatter Plot of Adjusted Close Prices: MAGS vs Intel')
plt.xlabel('MAGS Adjusted Close Price')
plt.ylabel('Intel Adjusted Close Price')
plt.show()
Pearson's correlation coefficient: 0.5282295025081082
The average correlational coefficient of 0.5 between the Magnificent Seven ETF (Mag 7) and Intel suggests a moderately positive relationship. While their stock prices tend to move somewhat together, the correlation isn't extremely strong. This insight implies that the stocks differ, as their price movements aren't perfectly synchronized. Overall, after this test we understand that Intel may not be an insightful feature to use while predicting the ETF.
We are now calculating technical indicators using the pandas_ta library. This library streamlines the process of calculating these indicators, giving us more time to concentrate on getting insights and creating visualizations. Technical analysis involves scrutinizing historical market data, such as price and volume, to forecast future price movements. Technical indicators are mathematical calculations derived from market data (in this case price of Magnificent Seven ETF), that are used to help traders and analysts in making informed trading decisions.
The indicators we're adding encompass a range of values that represent different parts of the stock and market. Moving Average Convergence/Divergence (MACD) highlights the relationship between two moving averages, indicating trend direction. Relative Strength Index (RSI) measures the speed and change of price movements which can be used in identifying overbought or oversold stocks. By calculating these indicators, we can understand the market dynamics, potentially spotting trading opportunities much better. These technical indicators, if informative enough, can be used as features in our stock prediction model.
Using the following link, you can read the documentation of the pandas_ta library and it's API documentation: https://github.com/twopirllc/pandas-ta
mag7_data.ta.macd(append=True) # Moving Average Convergence/Divergence
mag7_data.ta.rsi(append=True) # Relative Strength Index
Date
2023-05-08 NaN
2023-05-09 NaN
2023-05-10 NaN
2023-05-11 NaN
2023-05-12 NaN
...
2024-05-01 49.234011
2024-05-02 53.678298
2024-05-03 58.522552
2024-05-06 61.645198
2024-05-07 60.260892
Name: RSI_14, Length: 252, dtype: float64
Below we plot the RSI indicator to visualize the state of the ETF.
plt.plot(mag7_data['RSI_14'], label='RSI')
plt.axhline(y=70, color='r', linestyle='--', label='Overbought (70)')
plt.axhline(y=30, color='g', linestyle='--', label='Oversold (30)')
plt.legend()
plt.xticks(rotation=45, fontsize=8)
plt.title('RSI Indicator')
Text(0.5, 1.0, 'RSI Indicator')
Here the Relative Strength Index (RSI) falls within the normal range, and that suggests a balanced market sentiment without any extreme bullishness or bearishness. This basically means that there is no extreme movement in either direction, a stable price. In such instances, the RSI value usually oscillates between 30 and 70. A normal RSI indicates a stable trend. It suggests that the market is in a state of equilibrium, with neither side exerting overwhelming pressure. The RSI is not as extreme and may not be useful in a prediction for this ETF.
Calculating the Moving Average Convergence Divergence (MACD) provides us valuable insights into the Magnificent Seven ETF's historical market performance. The MACD is computed by taking the difference between two Exponential Moving Averages (EMAs), usually it is a shorter-term EMA (12-day) and a longer-term EMA (26-day). This calculation highlights the momentum and trend changes in the ETF's price movements over time. Interpreting the MACD involves looking at and following its relationship with a signal line, which is usually a 9-day EMA of the MACD line itself. When the MACD line crosses above the signal line, it indicates a bullish crossover, suggesting a potential uptrend in the market and in the ETF's price. on the Other hand, when the MACD line crosses below the signal line, it signals a bearish crossover, indicating a downtrend.
Note: A moving average is a calculation used to smooth out the changes in data by creating a constantly updated average of recent historical prices or values over a specific time period. It helps to identify the trends by reducing noise or random changes in the data, making it easier to visualize the underlying direction of the trend. In finance, moving averages are commonly applied to stock prices to analyze price movements over time and identify potential entry or exit points in the market. Becuase the moving average is free of noise, it provides us a very insightful lagging numerical feature to add to our model.
plotly.offline.init_notebook_mode()
fig = go.Figure()
fig.add_trace(go.Scatter(x=mag7_data.index, y=mag7_data['MACD_12_26_9'], mode='lines', name='MACD'))
fig.add_trace(go.Scatter(x=mag7_data.index, y=mag7_data['MACDs_12_26_9'], mode='lines', name='Signal'))
fig.add_trace(go.Bar(x=mag7_data.index, y=mag7_data['MACDh_12_26_9'], name='MACD Histogram', marker_color=['green' if val >= 0 else 'red' for val in mag7_data['MACDh_12_26_9']]))
# Customize the chart
fig.update_xaxes(rangeslider=dict(visible=False))
fig.update_layout(plot_bgcolor='#efefff', font_family='Monospace', font_color='#000000', font_size=20,width=1000)
fig.update_layout(title="MACD chart for Magnificent Seven ETF")
fig.show()
In this analysis of the MACD, we observe a bullish crossover between the MACD line and the signal line. This means that it is a sign of a uptrend in the market. This confirms a upwards trend in the ETF's future price movement. In a situation like this we are presented with an opportunity to invest and grow profit as we may consider entering or holding positions in the ETF with hopes of further price growth. By examining the historical MACD plot, we can gain valuable insights into past trend changes and momentum shifts of the fund. Overall, using this MACD analysis enhances our understanding of the Magnificent Seven ETF's market changes.
Now that we have insight on the important features and correlations in our data, we can start our primary analysis. From our previous analysis, we have concluded that we want to use the adjusted closing prices of Emerson Electric Co. (EMR), Taiwan Semiconductor Manufacturing Co. (TSM), and Intel (INTC), as well as the MACD of the Mag 7 stock, to predict the adjusted closing stock price of the Mag 7 stock itself. To do this we have decided to use a LSTM model.
Long Short-Term Memory Models (LSTMs) are used in stock analysis often. One of the main strengths of these models is its ability to capture and learn from long-term dependencies in sequential data. This is useful for our application as the model makes use of year-long stock data, and will need to capture patterns and trends that may not be initialliy obvious throughout the data, and learn from them in order to sufficiently predict the future stock price. These types of models are particularly useful for time-series data as well.
Let's start by compiling the features we want to use in our model into one dataframe.
model_data = pd.DataFrame({'Adj Close': mag7_data['Adj Close'],
'intel close': intel_data['Adj Close'],
'tsm close': tsm_data['Adj Close'],
'emr close': emr_data['Adj Close'],
'mag7': mag7_data['MACD_12_26_9']})
model_data = model_data.dropna()
The above code puts all the relevant features, which are the Intel adjusted closing price, the TSM adjusted closing price, the EMR adjusted closing price, and the MACD for Mag7 in the model data. It also includes the label, which is the MAGS adjusted closing price, which is what we are trying to predict, as the first column.
We then move on to splitting the data into our training and testing data. We use 1 month of the year-long data as our test data, while using the rest as our training data. We also scale all the data to the range between 0 and 1. This is done to make sure all the features are weighed the same by the model, as to not give importance to any feature over the others.
test_start_date = model_data.index[-1] - pd.DateOffset(months=1)
train_data = model_data.loc[model_data.index < test_start_date]
test_data = model_data.loc[model_data.index >= test_start_date]
# Scale the data for model
scaler = MinMaxScaler(feature_range=(0, 1))
train_data_scaled = scaler.fit_transform(train_data)
test_data_scaled = scaler.transform(test_data)
n_steps = 4
n_features = len(model_data.columns)
# Creating Sequences
def create_sequences(data, n_steps):
X, y = [], []
for i in range(len(data) - n_steps):
X.append(data[i:i + n_steps])
y.append(data[i + n_steps, 0]) # Assuming the target is in the first column
return np.array(X), np.array(y)
X_train, y_train = create_sequences(train_data_scaled, n_steps)
X_test, y_test = create_sequences(test_data_scaled, n_steps)
X_train = X_train.reshape((X_train.shape[0], n_steps, n_features))
X_test = X_test.reshape((X_test.shape[0], n_steps, n_features))
Above, we split the data into testing and training data, scaled the data, and also split the label from the features, formatting the features accordingly.
Now we can get into building the model.
# Creating and compilation of the LSTM model
model = Sequential()
model.add(LSTM(50, return_sequences=True, input_shape=(n_steps, n_features)))
model.add(LSTM(50, return_sequences=True))
model.add(LSTM(50))
model.add(Dense(1)) # Output layer with 1 neuron for regression
model.compile(optimizer='adam', loss='mean_squared_error')
c:\Users\dhruv\anaconda3\envs\320_final\Lib\site-packages\keras\src\layers\rnn\rnn.py:204: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
We have created a Sequential model with an input layer, hidden LSTM layers, and 1 output Dense layer. The first layer takes in the inputs of shape (n_steps, n_features), and each subsequent layer has 50 neurons. The final output layer compiles the values from the previous layers into one final output. We use Adam as our optimizer, and we use Mean Squared Error to compute the loss for each training epoch.
After this, we must train our model.
# Training
early_stopping = EarlyStopping(monitor='val_loss', patience=7)
history = model.fit(X_train, y_train, epochs=90, batch_size=64, verbose=2, validation_data=(X_test[7:], y_test[7:]))
epoch_loss = history.history['loss']
epoch_val_loss = history.history['val_loss']
Epoch 1/90 4/4 - 4s - 897ms/step - loss: 0.2199 - val_loss: 0.6842 Epoch 2/90 4/4 - 0s - 14ms/step - loss: 0.1527 - val_loss: 0.5007 Epoch 3/90 4/4 - 0s - 13ms/step - loss: 0.0790 - val_loss: 0.2749 Epoch 4/90 4/4 - 0s - 12ms/step - loss: 0.0270 - val_loss: 0.0811 Epoch 5/90 4/4 - 0s - 12ms/step - loss: 0.0363 - val_loss: 0.0630 Epoch 6/90 4/4 - 0s - 12ms/step - loss: 0.0277 - val_loss: 0.1281 Epoch 7/90 4/4 - 0s - 12ms/step - loss: 0.0173 - val_loss: 0.1770 Epoch 8/90 4/4 - 0s - 12ms/step - loss: 0.0192 - val_loss: 0.1638 Epoch 9/90 4/4 - 0s - 13ms/step - loss: 0.0151 - val_loss: 0.1097 Epoch 10/90 4/4 - 0s - 12ms/step - loss: 0.0093 - val_loss: 0.0672 Epoch 11/90 4/4 - 0s - 12ms/step - loss: 0.0088 - val_loss: 0.0521 Epoch 12/90 4/4 - 0s - 12ms/step - loss: 0.0066 - val_loss: 0.0689 Epoch 13/90 4/4 - 0s - 13ms/step - loss: 0.0052 - val_loss: 0.0718 Epoch 14/90 4/4 - 0s - 12ms/step - loss: 0.0048 - val_loss: 0.0508 Epoch 15/90 4/4 - 0s - 13ms/step - loss: 0.0041 - val_loss: 0.0363 Epoch 16/90 4/4 - 0s - 13ms/step - loss: 0.0044 - val_loss: 0.0403 Epoch 17/90 4/4 - 0s - 12ms/step - loss: 0.0041 - val_loss: 0.0478 Epoch 18/90 4/4 - 0s - 13ms/step - loss: 0.0041 - val_loss: 0.0491 Epoch 19/90 4/4 - 0s - 12ms/step - loss: 0.0040 - val_loss: 0.0464 Epoch 20/90 4/4 - 0s - 12ms/step - loss: 0.0039 - val_loss: 0.0430 Epoch 21/90 4/4 - 0s - 12ms/step - loss: 0.0039 - val_loss: 0.0474 Epoch 22/90 4/4 - 0s - 12ms/step - loss: 0.0038 - val_loss: 0.0535 Epoch 23/90 4/4 - 0s - 13ms/step - loss: 0.0038 - val_loss: 0.0526 Epoch 24/90 4/4 - 0s - 13ms/step - loss: 0.0037 - val_loss: 0.0477 Epoch 25/90 4/4 - 0s - 12ms/step - loss: 0.0037 - val_loss: 0.0470 Epoch 26/90 4/4 - 0s - 13ms/step - loss: 0.0036 - val_loss: 0.0521 Epoch 27/90 4/4 - 0s - 13ms/step - loss: 0.0037 - val_loss: 0.0515 Epoch 28/90 4/4 - 0s - 12ms/step - loss: 0.0036 - val_loss: 0.0447 Epoch 29/90 4/4 - 0s - 13ms/step - loss: 0.0036 - val_loss: 0.0392 Epoch 30/90 4/4 - 0s - 12ms/step - loss: 0.0036 - val_loss: 0.0462 Epoch 31/90 4/4 - 0s - 13ms/step - loss: 0.0035 - val_loss: 0.0482 Epoch 32/90 4/4 - 0s - 12ms/step - loss: 0.0035 - val_loss: 0.0439 Epoch 33/90 4/4 - 0s - 13ms/step - loss: 0.0035 - val_loss: 0.0452 Epoch 34/90 4/4 - 0s - 11ms/step - loss: 0.0036 - val_loss: 0.0516 Epoch 35/90 4/4 - 0s - 12ms/step - loss: 0.0036 - val_loss: 0.0358 Epoch 36/90 4/4 - 0s - 13ms/step - loss: 0.0034 - val_loss: 0.0314 Epoch 37/90 4/4 - 0s - 13ms/step - loss: 0.0034 - val_loss: 0.0405 Epoch 38/90 4/4 - 0s - 12ms/step - loss: 0.0034 - val_loss: 0.0368 Epoch 39/90 4/4 - 0s - 13ms/step - loss: 0.0033 - val_loss: 0.0286 Epoch 40/90 4/4 - 0s - 13ms/step - loss: 0.0033 - val_loss: 0.0393 Epoch 41/90 4/4 - 0s - 12ms/step - loss: 0.0038 - val_loss: 0.0425 Epoch 42/90 4/4 - 0s - 12ms/step - loss: 0.0033 - val_loss: 0.0208 Epoch 43/90 4/4 - 0s - 13ms/step - loss: 0.0040 - val_loss: 0.0245 Epoch 44/90 4/4 - 0s - 13ms/step - loss: 0.0032 - val_loss: 0.0391 Epoch 45/90 4/4 - 0s - 12ms/step - loss: 0.0035 - val_loss: 0.0285 Epoch 46/90 4/4 - 0s - 13ms/step - loss: 0.0031 - val_loss: 0.0248 Epoch 47/90 4/4 - 0s - 12ms/step - loss: 0.0030 - val_loss: 0.0290 Epoch 48/90 4/4 - 0s - 12ms/step - loss: 0.0032 - val_loss: 0.0284 Epoch 49/90 4/4 - 0s - 12ms/step - loss: 0.0030 - val_loss: 0.0206 Epoch 50/90 4/4 - 0s - 12ms/step - loss: 0.0029 - val_loss: 0.0219 Epoch 51/90 4/4 - 0s - 12ms/step - loss: 0.0029 - val_loss: 0.0196 Epoch 52/90 4/4 - 0s - 13ms/step - loss: 0.0028 - val_loss: 0.0197 Epoch 53/90 4/4 - 0s - 13ms/step - loss: 0.0028 - val_loss: 0.0167 Epoch 54/90 4/4 - 0s - 13ms/step - loss: 0.0029 - val_loss: 0.0146 Epoch 55/90 4/4 - 0s - 12ms/step - loss: 0.0028 - val_loss: 0.0187 Epoch 56/90 4/4 - 0s - 12ms/step - loss: 0.0028 - val_loss: 0.0158 Epoch 57/90 4/4 - 0s - 12ms/step - loss: 0.0029 - val_loss: 0.0153 Epoch 58/90 4/4 - 0s - 12ms/step - loss: 0.0029 - val_loss: 0.0102 Epoch 59/90 4/4 - 0s - 13ms/step - loss: 0.0028 - val_loss: 0.0184 Epoch 60/90 4/4 - 0s - 13ms/step - loss: 0.0032 - val_loss: 0.0147 Epoch 61/90 4/4 - 0s - 13ms/step - loss: 0.0026 - val_loss: 0.0065 Epoch 62/90 4/4 - 0s - 12ms/step - loss: 0.0031 - val_loss: 0.0120 Epoch 63/90 4/4 - 0s - 13ms/step - loss: 0.0027 - val_loss: 0.0115 Epoch 64/90 4/4 - 0s - 12ms/step - loss: 0.0025 - val_loss: 0.0063 Epoch 65/90 4/4 - 0s - 12ms/step - loss: 0.0032 - val_loss: 0.0069 Epoch 66/90 4/4 - 0s - 13ms/step - loss: 0.0027 - val_loss: 0.0100 Epoch 67/90 4/4 - 0s - 13ms/step - loss: 0.0025 - val_loss: 0.0100 Epoch 68/90 4/4 - 0s - 13ms/step - loss: 0.0025 - val_loss: 0.0120 Epoch 69/90 4/4 - 0s - 12ms/step - loss: 0.0025 - val_loss: 0.0108 Epoch 70/90 4/4 - 0s - 12ms/step - loss: 0.0025 - val_loss: 0.0105 Epoch 71/90 4/4 - 0s - 12ms/step - loss: 0.0025 - val_loss: 0.0086 Epoch 72/90 4/4 - 0s - 13ms/step - loss: 0.0024 - val_loss: 0.0114 Epoch 73/90 4/4 - 0s - 12ms/step - loss: 0.0026 - val_loss: 0.0067 Epoch 74/90 4/4 - 0s - 12ms/step - loss: 0.0029 - val_loss: 0.0051 Epoch 75/90 4/4 - 0s - 13ms/step - loss: 0.0024 - val_loss: 0.0113 Epoch 76/90 4/4 - 0s - 13ms/step - loss: 0.0028 - val_loss: 0.0066 Epoch 77/90 4/4 - 0s - 13ms/step - loss: 0.0024 - val_loss: 0.0053 Epoch 78/90 4/4 - 0s - 13ms/step - loss: 0.0024 - val_loss: 0.0078 Epoch 79/90 4/4 - 0s - 12ms/step - loss: 0.0027 - val_loss: 0.0053 Epoch 80/90 4/4 - 0s - 13ms/step - loss: 0.0026 - val_loss: 0.0037 Epoch 81/90 4/4 - 0s - 13ms/step - loss: 0.0025 - val_loss: 0.0090 Epoch 82/90 4/4 - 0s - 12ms/step - loss: 0.0026 - val_loss: 0.0062 Epoch 83/90 4/4 - 0s - 13ms/step - loss: 0.0026 - val_loss: 0.0054 Epoch 84/90 4/4 - 0s - 13ms/step - loss: 0.0025 - val_loss: 0.0079 Epoch 85/90 4/4 - 0s - 13ms/step - loss: 0.0024 - val_loss: 0.0038 Epoch 86/90 4/4 - 0s - 13ms/step - loss: 0.0025 - val_loss: 0.0049 Epoch 87/90 4/4 - 0s - 12ms/step - loss: 0.0024 - val_loss: 0.0043 Epoch 88/90 4/4 - 0s - 13ms/step - loss: 0.0023 - val_loss: 0.0047 Epoch 89/90 4/4 - 0s - 13ms/step - loss: 0.0023 - val_loss: 0.0049 Epoch 90/90 4/4 - 0s - 14ms/step - loss: 0.0023 - val_loss: 0.0056
We train the model for 90 epochs, passing in 64 batches of input data at a time. We use some of the test data as validation data to ensure the model is working correctly. We then calculate the training loss and validation loss over every epoch.
Now our model is ready to be used! We try to predict the outputs of the test data as follows.
# Predictions
train_predict = model.predict(X_train)
test_predict = model.predict(X_test)
7/7 ━━━━━━━━━━━━━━━━━━━━ 1s 48ms/step 1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 29ms/step
We predict the outputs using our trained model on both the training data and the testing data, and save the outputs in the train_predict and test_predict variables respectively. Now we can use these predictions and compare them to the ground truth values to see how well our model did.
We have one more step, which is to convert all our data back to its original format by undoing the scaling and reshaping the data.
# Convertion of data back into original scale to view
zeros_array = np.zeros((train_predict.shape[0], n_features-1))
train_predict_combined = np.hstack((train_predict, zeros_array))
train_predict = scaler.inverse_transform(train_predict_combined)
train_predict = train_predict[:, 0]
zeros_array = np.zeros((test_predict.shape[0], n_features-1))
test_predict_combined = np.hstack((test_predict, zeros_array))
test_predict = scaler.inverse_transform(test_predict_combined)
test_predict = test_predict[:, 0]
zeros_array_train = np.zeros((y_train.shape[0], n_features-1))
zeros_array_test = np.zeros((y_test.shape[0], n_features-1))
y_train_combined = np.hstack((y_train.reshape(-1, 1), zeros_array_train))
y_test_combined = np.hstack((y_test.reshape(-1, 1), zeros_array_test))
y_train_combined = scaler.inverse_transform(y_train_combined)
y_test_combined = scaler.inverse_transform(y_test_combined)
y_train = y_train_combined[:, 0]
y_test = y_test_combined[:, 0]
date_index = test_data.index[n_steps:]
predictions_df = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': test_predict.flatten()}, index=date_index)
Now we have all of our data formatted correctly and our predictions stored in the predictions_df dataframe. Our next step is to visualize our findings.
Let's start by comparing our predicted adjusted closing prices with the actual ground truth prices. This is shown in the graph below.
plt.figure(figsize=(12, 6))
plt.plot(predictions_df.index, predictions_df['Actual'], label='Actual Prices', color='b')
plt.plot(predictions_df.index, predictions_df['Predicted'], label='Predicted Prices', color='r')
plt.title('Actual vs. Predicted Stock Prices by Date')
plt.xlabel('Date')
plt.ylabel('Stock Price')
plt.legend()
plt.grid(True)
plt.show()
We can see that the predictions follow a similar trend as the actual values, although this is not as close as we would hope for. This is because of the limitations of the LSTM model as well as other factors, discussed below in the insights section.
We can also visualize our loss during training and validation over each training epoch, as shown below.
plt.plot(range(1,91),epoch_loss,label='Training Loss')
plt.xlabel("Epoch")
plt.ylabel('Training Loss')
plt.title('Training Loss over Epochs')
plt.legend()
plt.show()
plt.plot(range(1,91),epoch_val_loss, label='Validation Loss', color='r')
plt.xlabel("Epoch")
plt.ylabel('Validation Loss')
plt.title('Validation Loss over Epochs')
plt.legend()
plt.show()
We can see a clear downward trend in both of these graphs, indicating that the model was learning throughout the process and its predictions were getting better. The loss indicates how far the predicted values deviate from the actual ground truth values. The training loss is the loss calculated using the training data, whereas the validation loss shows the loss when the partially trained model is tested on a small portion of the data, called validation data.
Now that we have visualized our predictions and the loss, we can make certain insights on what all of this means, and how it is relevant to our project.
Through the creation of our model, we found that using the close values of companies directly involved in the creation of Magnificent 7 products gave us a valid predictor for the close value of the stock. Although our model is rudimentary and isn't a strong predictor for the stock's value, it generally follows the up-turns and down-turns of the stock in such a way that using this as a predictor is stronger than the mean. Through this project we learnt that perfectly trying to capture and predict the values of a stock requires a level of analysis and information much larger than that of our simple model. There were, additionally, a number of limitations to our progress that affected our ability to create a stronger predictor. For example, the fact that our moving average feature was only limited to the past year greatly inhibited the amount of training data we could build on. Additionally, the model we used (an LSTM) is highly susceptible to memorizing noise in a dataset, and as we only used a few features it is very likely that our model overfitted to detrement of our accuracy.
If we conducted further analysis, we would look for other indicators involved in the price fluctuation of Magnificent 7 stock - through exploring general market trends and online sentiment analysis- and we would invest into the stock using our model to test if extended use is indeed profitable. In all though, we utilized each part of the data science pipeline, from cleaning to feature engineering to machine learning analysis, to create a functional Recurrent Neural Network that gives investors an edge in Magnificent 7 investment.